Workflow orchestration
Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).
Workflow orchestration in the ai-on-gke project involves managing and automating the execution of complex, multi-step processes, primarily for AI/ML workloads on Google Kubernetes Engine (GKE).
This tutorial guides you through fine-tuning the Gemma 3-1B-it language model on Google Kubernetes Engine (GKE) using L4 GPU, leveraging Parameter Efficient Fine Tuning (PEFT) and LoRA. It covers setting up a GKE cluster, containerizing the fine-tuning code, running the fine-tuning job, and uploading the resulting model to Hugging Face. Finally, it demonstrates how to deploy and interact with the fine-tuned model using vLLM on GKE.
This guide illustrates the deployment of Flyte on Google Kubernetes Engine (GKE) using Helm, utilizing Google Cloud Storage for scalable data storage and Cloud SQL PostgreSQL for a reliable metadata store. By the end of this tutorial, you will have a fully functional Flyte instance on GKE, offering businesses seamless integration with the GCP ecosystem, improved resource efficiency, and cost-effectiveness.
This guide uses the Google Cloud API to create a Hyperdisk ML disk from data in Cloud Storage and then use it in a GKE cluster. Refer to this documentation for instructions all in the GKE API.
Deploying and managing servers dedicated to performing inference tasks for machine learning models.
In this tutorial, you will learn how to deploy a chatbot application using LangChain and Streamlit on Google Cloud Platform (GCP).
This tutorial will guide you through creating a robust Retrieval-Augmented Generation (RAG) system using LlamaIndex and deploying it on Google Kubernetes Engine (GKE).
This tutorial will provide instructions on how to deploy and use the Metaflow framework on GKE (Google Kubernetes Engine) and operate AI/ML workloads using Argo-Workflows.
In this tutorial we will fine-tune gemma-2-9b using LoRA as an experiment in MLFlow. We will deploy MLFlow on a GKE cluster and set up MLFlow to store artifacts inside a GCS bucket. In the end, we will deploy a fine-tuned model using KServe.
This project allows you to download a Hugging Face model and package it as a Docker image. The Docker image can then be pushed to Google Artifact Registry for deployment or distribution. Build time can be significant for large models, it is recommended to not exceed models above 10 billion parameters. For reference 8b model roughly takes 35 minutes to build and push with this cloudbuild config.
This guide provides instructions for deploying and managing Ray custom resources on Google Kubernetes Engine (GKE) with TPUs. It details how to install the KubeRay TPU webhook, an admission webhook which bootstraps required environment variables for TPU initialization and enables atomic scheduling of multi-host TPU workers on GKE nodepools. This guide also provides a sample workload to verify proper TPU initialization and links to more advanced workloads to run with TPUs and Ray on GKE.
In this tutorial, we will demonstrate how to leverage the open-source software SkyPilot to help GKE customers efficiently obtain accelerators across regions, ensuring workload continuity and optimized resource utilization.
This tutorial shows you who to serve a large language model (LLM) using both Tensor Processing Units (TPUs) and GPUs on Google Kubernetes Engine (GKE) using the same deployment with vLLM